Genomic Signatures from DNA Word Graphs
نویسندگان
چکیده
Genomes have both deterministic and random aspects, with the underlying DNA sequences exhibiting features at numerous scales, from codons and cis-elements through genes and on to regions of conserved or divergent gene order. The DNAWords program aims to identify mathematical structures that characterize genomes at multiple scales. The focus of this work is the fine structure of genomic sequences, the manner in which short nucleotide sequences fit together to comprise the genome as an abstract sequence, within a graph-theoretic setting. A DNA word graph is a generalization of a de Bruijn graph that records the occurrence counts of node and edges in a genomic sequence. A DNA word graph can be derived from a genomic sequence generated by a finite Markov chain or a subsequence of a sequenced genome. Both theoretically and empirically, DNA word graphs give rise to genomic signatures. Several genomic signatures are derived from the structure of a DNA word graph, including an information-rich and visually appealing genomic bar code. Application of genomic signatures to several genomes demonstrate their practical value in identifying and distinguishing genomic sequences.
منابع مشابه
Word-Based Characterization of the Bidirectional Promoters from the Human DNA-Repair Pathway
A word-based genomic signature for a group of related genomic sequences is a set of characteristic subsequences. Unlike most existing genomic signatures, a word-based genomic signature provides insights that are directly applicable to the problem of identifying functional DNA elements. The effectiveness of the word-based genomic signature method is shown by analyzing promoter sequences for gene...
متن کاملExamination of Genome Homogeneity in Prokaryotes Using Genomic Signatures
BACKGROUND DNA word frequencies, normalized for genomic AT content, are remarkably stable within prokaryotic genomes and are therefore said to reflect a "genomic signature." The genomic signatures can be used to phylogenetically classify organisms from arbitrary sampled DNA. Genomic signatures can also be used to search for horizontally transferred DNA or DNA regions subjected to special select...
متن کاملResolving Prokaryotic Taxonomy without rRNA: Longer Oligonucleotide Word Lengths Improve Genome and Metagenome Taxonomic Classification
Oligonucleotide signatures, especially tetranucleotide signatures, have been used as method for homology binning by exploiting an organism's inherent biases towards the use of specific oligonucleotide words. Tetranucleotide signatures have been especially useful in environmental metagenomics samples as many of these samples contain organisms from poorly classified phyla which cannot be easily i...
متن کاملGenomic Signatures in De Bruijn Chains
Genomes have both deterministic and random aspects, with the underlying DNA sequences exhibiting features at numerous scales, from codons to regions of conserved or divergent gene order. This work examines the unique manner in which oligonucleotides fit together to comprise a genome, within a graph-theoretic setting. A de Bruijn chain (DBC) is a generalization of a finite Markov chain. A DNA wo...
متن کاملDirect Construction of Compact Directed Acyclic Word Graphs
The Directed Acyclic Word Graph (DAWG) is an e cient data structure to treat and analyze repetitions in a text, especially in DNA genomic sequences. Here, we consider the Compact Directed Acyclic Word Graph of a word. We give the rst direct algorithm to construct it. It runs in time linear in the length of the string on a xed alphabet. Our implementation requires half the memory space used by D...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2007